Skip to content

SYCL: reduce allocation overhead during flash attention#22732

Merged
ggerganov merged 6 commits intoggml-org:masterfrom
sanmai:fa-overhead-sycl
May 9, 2026
Merged

SYCL: reduce allocation overhead during flash attention#22732
ggerganov merged 6 commits intoggml-org:masterfrom
sanmai:fa-overhead-sycl

Conversation

@sanmai
Copy link
Copy Markdown
Contributor

@sanmai sanmai commented May 5, 2026

Fixes #22585

Overview

I found that flash attention allocated quite a few K/V buffers with little reuse, which remained in the legacy pool until teardown. And it sounds like a better strategy is to allocate the FA buffers outside the common pool and grow them on demand. So that at most, FA uses the largest buffers it needs.

Arguably, there are more optimal strategies: we still leave the buffers occupied. That said, there aren't evictions in the legacy pool, so at very least we are in a better spot.

  • To reduce the number of allocations, buffers grow in chunks of 16 MiB
  • The requests grow like this: 0.5, 1.0, 1.5, 2.0, 2.5... but that could be model-dependent
  • FA completes before ggml_sycl_fattn_alloc::alloc is called again so no queue sync should be needed but I added one anyway just to be extra safe

Additional information

Memory benchmarks with B60 and Qwen3.6-35B-A3B UD-Q4_K_M, q4_0:

Baseline (without buffers)

Smaller prompt:

~ggml_sycl_pool_leg: 9 buffers, cached = 181.67 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/50.40/8.40/50.40

Larger prompt:

~ggml_sycl_pool_leg: 23 buffers, cached = 1170.24 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/50.40/85.05/8.40/50.40/53.02/56.17/59.32/62.47/65.62/67.20/69.30/70.88/72.97/74.55/76.65/78.75/80.85/82.95

With buffers

Smaller prompt:

ggml_sycl_fattn_kv_buffer[0]: 16.00 MiB
ggml_sycl_fattn_kv_buffer[0]: 16.00 MiB
~ggml_sycl_pool_leg: 7 buffers, cached = 80.87 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/8.40

68.80 MiB savings compared to the baseline.

Larger prompt:

ggml_sycl_fattn_kv_buffer[0]: 96.00 MiB
ggml_sycl_fattn_kv_buffer[0]: 96.00 MiB
~ggml_sycl_pool_leg: 7 buffers, cached = 80.87 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/8.40

that's 272.87 total for the pool plus the two FA buffers, or 897.37 MiB savings compare to the baseline. In a memory constrained environment it could be a deal breaker.

It spills into the common memory breakdown just as one expects.

 | memory breakdown [MiB]                        | total   free     self   model   context   compute    unaccounted |
-|   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 =  854 + (20003 = 18727 +     782 +     493) +        2398 |
+|   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 = 1129 + (20643 = 18727 +    1422 +     493) +        1483 |
 |   - Host                                      |                  2786 =  2522 +       0 +     264                |

Discussion

From what I can tell, other backends, such as CUDA, do not have this problem. Specifically, CUDA uses a VMM pool. I briefly considered adding ggml_sycl_pool_vmm as suggested by one of the TODOs in the code, but quickly stumbled into allocation granularity issues.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES; I used them in assistive capacity for profiling, prototyping, code review.

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 5, 2026
Comment thread ggml/src/ggml-sycl/common.hpp Outdated

sycl::half * alloc(size_t n_elems) {
ptr = buf.ensure_half(n_elems);
return ptr;
Copy link
Copy Markdown
Contributor Author

@sanmai sanmai May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the calling code does not use the return value but pool_alloc::alloc returns so keep doing that too to reduce the surprise if someone changes the old code to use the return value later

ggml_sycl_pool_alloc<sycl::half> K_f16(pool);
ggml_sycl_pool_alloc<sycl::half> V_f16(pool);
ggml_sycl_fattn_alloc K_f16(fbuf.K);
ggml_sycl_fattn_alloc V_f16(fbuf.V);
Copy link
Copy Markdown
Contributor Author

@sanmai sanmai May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered adding a no-op template to make it look more uniform:

ggml_sycl_fattn_alloc<sycl::half>   K_f16(fbuf.K);
ggml_sycl_fattn_alloc<sycl::half>   V_f16(fbuf.V);

@sanmai sanmai marked this pull request as ready for review May 5, 2026 23:05
@sanmai sanmai requested a review from a team as a code owner May 5, 2026 23:05
Copy link
Copy Markdown
Contributor

@arthw arthw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good job!
I test it and the memory usage is reduced. It's great to users.

comments:

  1. DEBUG_SYCL_POOL
    Please update the description in SYCL.md

  2. The feature is about Flash-attention, please move the code in fattn-common.hpp or fattn-xxx.cpp/hpp. flash-attention is big feature, suggest moving all code in fattn-xxx files.

common.hpp is for more common code, and ggml-sycl.cpp is for base API of SYCL backend.

Thank you!

@sanmai
Copy link
Copy Markdown
Contributor Author

sanmai commented May 8, 2026

I appreciate the review! Should be ready for another round.

@sanmai sanmai requested a review from arthw May 8, 2026 06:58
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 8, 2026
Comment thread ggml/src/ggml-sycl/fattn-buffers.cpp
Copy link
Copy Markdown
Contributor

@arthw arthw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good job!

Thank you!

@arthw arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 9, 2026
@ggerganov
Copy link
Copy Markdown
Member

From what I can tell, other backends, such as CUDA, do not have this problem. Specifically, CUDA uses a VMM pool. I briefly considered adding ggml_sycl_pool_vmm as suggested by one of the TODOs in the code, but quickly stumbled into allocation granularity issues.

Would the VMM solution be re-considered in the future? Maybe add TODOs?

@sanmai
Copy link
Copy Markdown
Contributor Author

sanmai commented May 9, 2026

VMM is next once this goes through (the guidance says 1 PR in flight at most for new contributors); that said, it this approach still works to decrease the unaccounted memory usage even together with VMM. So they aren't exclusive.

- without vmm pool:
+ with vmm:
 common_memory_breakdown_print: | memory breakdown [MiB]                        | total   free     self   model   context    compute    unaccounted |
-common_memory_breakdown_print: |   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 =   56 + (21100 = 16104 +    4501 +     495) +        2098 |
+common_memory_breakdown_print: |   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 =  223 + (21100 = 16104 +    4501 +     495) +        1931 |
 common_memory_breakdown_print: |   - Host                                      |                   958 =   682 +       0 +     276                |

So VMM improves things (1931 vs 2098 without) but not as much as the current approach (1483 vs 2398).

And there are many other uses for VMM.

@ggerganov ggerganov merged commit e20b839 into ggml-org:master May 9, 2026
49 checks passed
@sanmai sanmai deleted the fa-overhead-sycl branch May 9, 2026 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SYCL: flash-attention buffers are retained across long-context ubatches causing linear VRAM growth

3 participants